Data Frames, Plotting and Dates
1 Adding Color to Plots
Color is often refered to as the third dimension of a 2-dimensional plot, because
it allows us to add extra information to an ordinary scatterplot. Consider the
graph of literacy and income. By examining boxplots, we can see that there are
differences among the distributions of income (and literacy) for the different
continents, and it would be nice to display some of that information on a
scatterplot. This is one situation where factors come in very handy. Since
factors are stored internally as numbers (starting at 1 and going up to the number
of unique levels of the factor), it's very easy to assign different observations
different colors based on the value of a factor variable.
To illustrate, let's replot the income vs. literacy graph, but this time we'll
convert the continent into a factor and use it to decide on the color of the
points that will be used for each country. First, consider the world1
data frame. In that data frame, the continent is stored in the column (variable)
called cont. We convert this variable to a factor with the factor
function. First, let's look at the mode and class of the variable before we
convert it to a factor:
> mode(world1$cont)
[1] "character"
> class(world1$cont)
[1] "character"
> world1$cont = factor(world1$cont)
In many situations, the cont variable will behave the same as it did
when it was a simple
character variable, but notice that its mode and class have changed:
> mode(world1$cont)
[1] "numeric"
> class(world1$cont)
[1] "factor"
Having made cont into a factor, we need to choose some colors to
represent the different continents. There are a few ways to tell R what
colors you want to use. The easiest is to just use a color's name. Most
colors you think of will work, but you can run the colors function
without an argument to see the official list. You can also use the method
that's commonly use by web designers, where colors are specified as a pound
sign (#) followed by 3 sets of hexadecimal digits providing the levels
of red, green and blue, respectively. Using this scheme, red is represented as
'#FF0000', green as '#00FF00', and blue as '#0000FF'. To see how many unique
values of cont there are, we can use the levels function,
since it's a factor. (For non-factors, the unique function is available,
but it may give the levels in an unexpected order.)
> levels(world1$cont)
[1] "AF" "AS" "EU" "NA" "OC" "SA"
There are six levels. The first step is to create a vector of color values:
mycolors = c('red','yellow','blue','green','orange','violet')
To make the best possible graph, you should probably be more careful
when choosing the colors, but this will serve as a simple example.
Now, when we make the scatterplot, we add an additional argument, col=,
which is a vector of the same length as the number of pairs of points that we're
plotting - the color in each position corresponds to the color that will be used
to draw that point on the graph. Probably the easiest way to do that is to use
the value of the factor cont as a subscript to the mycolors
vector that we created earlier. (If you don't see why this does what we want,
please take a look at the result of mycolors[world1$cont]).
with(world1,plot(literacy,income,col=mycolors[cont]))
There's one more detail that we need to take care of. Since we're
using color on the graph, we have to provide some way that someone viewing the
graph can tell which color represents which continent, i.e. we need to add a
legend to the graph. In R, this is done with the legend command.
There are many options to this command, but in it's simplest form we just tell
R where to put the legend, whether we should show points or lines, and what
colors they should be. A title for the legend can also be added, which is a
good idea in this example, because the meaning of the continent abbreviations
may not be immediately apparent. You can specify x- and y-coordinates for the legend
location or you can use one of several shortcuts like "topleft" to do things
automatically. (You may also want to look at the locator command,
that lets you decide where to place your legends interactively). For our
example, the following will place a legend in an appropriate place; the
title command is also used to add a title to the plot:
with(world1,legend('topleft',legend=levels(cont),col=mycolors,pch=1,title='Continent'))
title('Income versus Literacy for Countries around the World')
Here's what the plot looks like:
2 Taking More Control Over Graphics
Although consulting the help file for a particular plotting function will
often yield useful information, the R graphics system relies on a general
method for setting a variety of graphical parameters through the par
function. You should definitely familiarize yourself with the capabilities of
this function before trying to customize any graphics. Two parameters that
you will probably want to use are xlim= and ylim=. These
parameters each accept a vector of length two, showing the minimum and maximum
values that will be displayed on the x- and y-axes, respectively. For example,
suppose we are investigating the relationship between income and military spending
in the world1 data frame:
> plot(world1$income,world1$military)
The problem is that the large outlier for military spending makes it very
difficult to see the relationships among the other points. To resolve this
problem, we can replot the graph, using the ylim= parameter to
restrict the y-axis from 0 to 1e+11:
plot(world1$income,world1$military,ylim=c(0,1e11))
Many other graphics parameters exist to control things like the size and
spacing of axis labels, the number of tick marks on the axes, the size of
your plot and many other details.
3 Using Dates in R
Dates on computers have been the source of much anxiety, especially at the
turn of the century, when people felt that many computers wouldn't understand
the new millenium. These fears were based on the fact that certain programs
would store the value of the year in just 2 digits, causing great confusion
when the century "turned over". In R, dates are stored as they have traditionally
been stored on Unix computers - as the number of days from a reference date,
in this case January 1, 1970, with earlier days being represented by negative
numbers. When dates are stored this way, they can be manipulated like any other
numeric variable (as far as it makes sense). In particular, you can compare or
sort dates, take the difference between two dates, or add an increment of days,
weeks, months or years to a date. The class of such dates is Date and
their mode is numeric. Dates are created with as.Date, and formatted for
printing with format (which will recognize dates and do the right
thing.)
Because dates can be written in so many different formats, R uses a
standard way of providing flexibility when reading or displaying dates. A
set of format codes, some of which are shown in the table below, is used to
describe what
the input or output form of the date looks like. The default format for
as.Date is a four digit year, followed by a month, then a day, separated
by either dashes or slashes. So conversions like this happen automatically:
> as.Date('1915-6-16')
[1] "1915-06-16"
> as.Date('1890/2/17')
[1] "1890-02-17"
The formatting codes are as follows:
Code | Value |
%d | Day of the month (decimal number) |
%m | Month (decimal number) |
%b | Month (abbreviated) |
%B | Month (full name) |
%y | Year (2 digit) |
%Y | Year (4 digit) |
(For a complete list of the format codes, see the R help page for
the strptime function.)
As an example of reading dates, the URL http://www.stat.berkeley.edu/classes/s133/data/movies.txt
contains the names, release dates, and box office earnings for around 700 of the
most popular movies of all time. The first few lines of the input file look like
this:
Rank|name|box|date
1|Titanic|$600.788|December 19, 1997
2|Avatar|$594.472|December 18, 2009
3|The Dark Knight|$529.143|July 18, 2008
As can be seen, the fields are separated by vertical bars, so
we can use read.delim with the appropriate sep= argument.
> movies = read.delim('http://www.stat.berkeley.edu/classes/s133/data/movies.txt',
+ sep='|',stringsAsFactors=FALSE)
> head(movies)
Rank name box date
1 1 Titanic $600.788 December 19, 1997
2 2 Avatar $594.472 December 18, 2009
3 3 The Dark Knight $529.143 July 18, 2008
4 4 Star Wars: Episode IV - A New Hope $460.998 May 25, 1977
5 5 Shrek 2 $436.471 May 19, 2004
6 6 E.T. the Extra-Terrestrial $433.005 June 11, 1982
The first step in using a data frame is making sure that we know what we're
dealing with. A good first step is to use the sapply function to look
at the mode of each of the variables:
> sapply(movies,mode)
rank name box date
"numeric" "character" "character" "character"
Unfortunately, the box office receipts (box) are character, not
numeric. That's because R doesn't recognize a dollar sign ($) as being part
of a number. (R has the same problem with commas.) We can remove the dollar sign with the sub function, and
then use as.numeric to make the result into a number:
> movies$box = as.numeric(sub('\\$','',movies$box))
To convert the character date values to R Date objects, we can
use as.Date with the appropriate format:
in this case it's the month name followed
by the day of the month, a comma and the four digit year. Consulting the table
of format codes, this translates to '%B %d, %Y':
> movies$date = as.Date(movies$date,'%B %d, %Y')
> head(movies$date)
[1] "1997-12-19" "2009-12-18" "2008-07-18" "1977-05-25" "2004-05-19"
[6] "1982-06-11"
The format that R now uses to print the dates is the standard Date format,
letting us know that we've done the conversion correctly. (If we wanted to recover
the original format, we could use the format function with a format similar
to the one we used to read the data.)
Another way to create dates is with the ISOdate function. This function
accepts three numbers representing the year, month and day of the date that is
desired. So to reproduce the last date in the previous vector, we could use
> lastdate = ISOdate(2002,5,3)
> lastdate
[1] "2002-05-03 12:00:00 GMT"
Notice that, along with the date, a time is printed. That's because
ISOdate returns an object of class POSIXt, not Date.
To make a date like this work properly with objects of class Date, you
can use the as.Date function.
Once we've created an R Date value, we can use the functions months,
weekdays or quarters to extract those parts of the date. For example,
to see which day of the week these very popular movies were released, we could use
the table function combined with weekdays:
> table(weekdays(movies$date))
Friday Monday Saturday Sunday Thursday Tuesday Wednesday
738 13 9 9 42 24 165
Notice that the ordering of the days is not what we'd normally expect. This
problem can be solved by creating a factor that has the levels in the
correct order:
> movies$weekday = factor(weekdays(movies$date),
+ levels = c('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'),ordered=TRUE)
Now we can use weekday to get a nicer table:
> table(movies$weekday)
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
13 24 165 42 738 9 9
Similarly, if we wanted to graphically display a chart showing which month of
the year the popular movies were released in, we could first create an
ordered factor, then use the barplot function:
> movies$month = factor(months(movies$date),levels=c('January','February','March','April','May','June','July','August','September','October','November','December'),ordered=TRUE)
> barplot(table(movies$month))
To do a similar thing with years, we'd have to create a new variable that represented
the year using the format function. For a four digit year the format
code is %Y, so we could make a table of the hit movies by year like this:
> table(format(movies$date,'%Y'))
1938 1939 1940 1942 1946 1950 1953 1955 1956 1959 1961 1963 1964 1965 1967 1968
1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 2
1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984
1 5 2 3 2 7 3 5 4 6 11 10 8 11 14 11
1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
11 12 15 15 24 20 23 28 24 20 35 28 34 41 41 46
2001 2002 2003 2004 2005 2006 2007 2008 2009
47 49 56 58 50 61 46 51 40
File translated from
TEX
by
TTH,
version 3.67.
On 1 Feb 2010, 13:18.